Skip to content

[feat] Allow middle checkpoint evaluation in background using lmms-eval http server#127

Merged
kcz358 merged 14 commits intomainfrom
dev/eval_checkpoint
Jan 12, 2026
Merged

[feat] Allow middle checkpoint evaluation in background using lmms-eval http server#127
kcz358 merged 14 commits intomainfrom
dev/eval_checkpoint

Conversation

@kcz358
Copy link
Collaborator

@kcz358 kcz358 commented Jan 9, 2026

Motivation

Modifications

Commit Message Convention

Please follow our standardized commit message format:

  • [feat] - New features or functionality
  • [fix] - Bug fixes
  • [docs] - Documentation changes only
  • [style] - Code style changes (formatting, missing semicolons, etc.)
  • [refactor] - Code refactoring without changing functionality
  • [perf] - Performance improvements
  • [test] - Adding or updating tests
  • [chore] - Maintenance tasks, dependency updates, etc.
  • [ci] - CI/CD configuration changes

Examples:

  • [feat] add qwen omni iterable dataset support
  • [fix] resolve bagel model configuration error
  • [docs] update training guide with YAML examples

See CONTRIBUTING.md for more details.

CI/CD Checks

Your PR will automatically run the following checks:

  • Linting: Code formatting with black (line-length=120) and import sorting with isort
  • Run pre-commit run --all-files locally to verify before pushing

Checklist

  • Follow commit message convention (see above)
  • Run pre-commit run --all-files and ensure all checks pass
  • Format your code with black (line-length=120) and isort
  • Add unit tests for new functionality
  • Update documentation as needed, including docstrings or example tutorials
  • Ensure all CI/CD checks pass

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdb5738e4d

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +268 to +271
logger.info("Waiting for pending evaluation jobs to complete...")
while len(self.eval_backend.pending_evals) > 0:
for eval_step, metrics in self.eval_backend.check_and_get_completed():
if rank == 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drain last eval results after pending jobs finish

When wait_until_complete=True, the loop condition only checks pending_evals. If the final job completes between the last check_and_get_completed() call and the next while condition check, the worker thread will move the job into results_queue and remove it from pending_evals, causing the loop to exit without ever logging that last result. This drops the final metrics at end-of-training; you can fix by draining the queue once more after the loop or by looping until both pending_evals is empty and the results queue is empty.

Useful? React with 👍 / 👎.

Comment on lines +97 to +99
for model_state_shard in shard_state_dicts:
tensor = model_state_shard.pop(key)
state_dict[key].append(tensor._local_tensor.bfloat16())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve original dtype when consolidating shards

The merger unconditionally converts each shard tensor to bfloat16 before concatenation. That will silently downcast checkpoints trained in fp32 or fp16, which can degrade accuracy or break downstream assumptions about dtype. Since this is a merge utility, it should preserve the original dtype from the shards rather than forcing bfloat16.

Useful? React with 👍 / 👎.

@kcz358
Copy link
Collaborator Author

kcz358 commented Jan 9, 2026

First install lmms-eval to use the eval client, no need for a full installation since only the client side are needed

cd /path/to/lmms-eval
uv pip install --no-deps . (or -e . for editable)
uv pip install fastapi uvicorn

Add the eval config in the yaml

  eval_config:
    server_url: "http://192.168.8.249:8000"
    poll_interval: 10.0
    checkpoint_key: "model"
    checkpoint_type: "regular"
    num_gpus: 8
    batch_size: 256

The wandb will log the eval result. There are something need to be noticed:

  1. since we disaggregated the training and evaluation side, checkpoints need to be saved before eval so save steps will need be equal to the eval steps.
  2. You might need to save the middle checkpoints (set a higher save total limit) so that we don't clean up the checkpoints before testing
  3. You will need to setup an eval server that shares the file storage system to training system right now.
  4. The evaluation runs quietly in the background.
image

@kcz358 kcz358 merged commit 3349d7b into main Jan 12, 2026
2 checks passed
@kcz358 kcz358 deleted the dev/eval_checkpoint branch January 12, 2026 02:01
kcz358 added a commit that referenced this pull request Jan 12, 2026
…al http server (#127)

* rfc ema utils so that the attribute is being retrieved after the first init

* [feat] Add FSDP2 checkpoint merger module

Add utilities for merging sharded FSDP2 checkpoints into single consolidated checkpoints for evaluation and inference. Includes base class and FSDP2 implementation with support for both regular and EMA checkpoints.

* [feat] Add eval server backend for asynchronous checkpoint evaluation

* [feat] Integrate eval server backend into FSDP2 trainer

* [feat] Add eval optional dependency with httpx

* [feat] Add lmms_engine_kwargs support for checkpoint merging

* [feat] Pass checkpoint_type to eval backend in validation_step

* [feat] Update version and config for eval/EMA features

* [fix] Fix EvalClient import and add eval_output_dir parameter

* [refactor] Remove output_dir and check_interval from EvalConfig

* [feat] Add eval_strategy check and wait for eval completion

* [feat] Define global_step as step_metric for eval metrics in wandb

* [feat] Use global_step in metrics for eval results logging

* [docs] Add async eval guide and update merge FSDP documentation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant